\documentclass{article}
\usepackage{url,graphicx}

\newcommand{\BEASTVersion}{2.0}
\newcommand{\TracerVersion}{1.5}
\newcommand{\FigTreeVersion}{1.3.1}

\begin{document}
\title{Phylogeography with BEAST 2.0}
\author{Remco Bouckaert \url{remco@cs.auckland.ac.nz}}
\maketitle

\section{Introduction}


In this tutorial we describe a full Bayesian framework for phylogeography developed in  \cite{Lemey:2009uq}.
 
You will need the following software at your disposal:

\begin{itemize}

\item {\bf BEAST} - this package contains the BEAST program, BEAUti, TreeAnnotator and other utility programs. This tutorial is written for BEAST v{\BEASTVersion}, which has support for multiple partitions. It is available for download from \\* \texttt{http://beast2.cs.auckland.ac.nz/}.
\item {\bf Tracer} - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and
quantitively summarizes the distributions of continuous parameters and provides diagnostic information. At the time of
writing, the current version is v{\TracerVersion}. It is available for download from \texttt{http://beast.bio.ed.ac.uk/}.
\item {\bf FigTree} - this is an application for displaying and printing molecular phylogenies, in particular those obtained using
BEAST. At the time of writing, the current version is v{\FigTreeVersion}. It is available for download from \texttt{http://tree.bio.ed.ac.uk/}.
\item {\bf Spread} for summarysing the geographic spread in a KML file (available from \url{http://www.kuleuven.ac.be/aidslab/phylogeography/SPREAD.html}.
\item {\bf google-earth} for displaying the KML file (just google for it, if you have not already have it installed).
\end{itemize}


This tutorial guides you through a continuous phylogegraphy analysis of a rabies epidemic spread though a raccoon population from the north-east of the united states. The original data comes from
\cite{biek:2007hi} and was used in a phylogeographic analysis before in \cite{Lemey:2010hc}. It consists of 47 sequences of 2811 characters.

We go through the following steps:
\begin{itemize}
\item The first step is to install the beast-classic add-on that contains the phylogegraphic models. 
\item Then, we use BEAUti to set up the analysis, and write it to an XML file.
\item We use BEAST to run the MCMC analysis based on the XML file.
\item The results will need to be checked for convergence using Tracer.
\item Finally, we post-process the output of the MCMC so we can visualise the geographical dispersion.
\end{itemize}

\subsubsection*{Set up BEAUti for phylogeography}

Phylogeography as described in this tutorial is part of the {\tt beast-classic} add-on.
If you not already have it installed, you can
install the add-on through BEAUti. Start BEAUti by double clicking on its icon. 
Select the File/Manage Add-ons menu. A dialog pops up showing the add-ons already installed. 

\begin{center}
\includegraphics[scale=0.4]{figures/addonmgr.png}
\end{center}

Select the BEAST-CLASSIC entry in the list, and click the Install button. After a little while the dialog is updated and it shows that the add-on is now installed.
BEAST-CLASSIC requires BEASTii, so if you have not already installed BEASTii, an error message may be shown warning that BEASTii should be installed as well.

\begin{center}
\includegraphics[scale=0.4]{figures/addonmgr2.png}
\end{center}


\subsection*{BEAUti}


\subsubsection*{Loading the NEXUS file }

To load a NEXUS format alignment, simply select the \texttt{Import
Alignment} option from the File menu: 

Select the file called \texttt{RacRABV.nex}, which is located in the place where the beast-classic add-on is installed 
under the {\tt examples/nexus} directory. Typically, add-ons are installed in different places, depending on your operating system:
\begin{itemize}
\item for Windows, in your home directory (e.g. {\tt c:$\backslash$Users$\backslash$joe}) under the {\tt BEAST} directory,
\item for Mac, in the {\tt Library/Application Support/BEAST} directory in your home directory (e.g. {\tt /Users/joe}),
\item for Linux, in the {\tt .beast} directory in your home directory (e.g {\tt /home/joe}).
\end{itemize}
The file contains an alignment of sequences. The \texttt{RacRABV.nex} looks like this (content has been truncated):

\begin{verbatim}
#NEXUS
BEGIN DATA;
       DIMENSIONS  NTAX =47 NCHAR=2811;
       FORMAT DATATYPE = DNA GAP = - MISSING = ?;
       MATRIX   
hOH10_97.2_41.053_80.706  TTCCCTATCTACACAATACCAGAC...
hWVa01_93.2_40.505_80.575 TTCCCTATCTACACAATACCAGAC...
NY01_03.4_41.057_73.794   TTCCCTATCTACACAATACCAGAC...
NY03_03.4_42.934_76.565   TTCCCTATCTACACAATACCAGAC...
         ... ...

;
END;
\end{verbatim}

\medskip{}

Once loaded, a partition is displayed in the main panel.
You can double click any alignment (partition) to show its detail.

\begin{figure}
\begin{center}

\includegraphics[scale=0.4]{figures/BEAUti_DataPartitions}

\end{center}
\caption{\label{fig.datapartition} Data partition panel after loading alignment.}
\end{figure}

\subsubsection*{Set up dates}

We want to use tip dates for this analysis.

Select the 'Tip Dates' tab, and click the 'Use tip dates' check box.

Since we can derive the date from the taxon names, click the 'Guess' button.

A dialog pops up, where we can specify the dates as follows: the dates are encoded between underscores in the name, and it is the second part. So, we want to split on character (select this box) and take group 2.

Dates can be in the late 1990s and early 2000s, but the 19 and 20 are not in the name. So, we have to add 1900 if the date is a high number and 2000 if it is a low number (say 13). Enter this information at the bottom of the dialog, and it should now look like this:

\includegraphics[scale=0.4]{figures/BEAUti_dates.png}

Click OK and the dates are populated by the correct value. Always double check that this happened correctly and no strange outliers or wrong encodings cause any problems, of course.

\subsubsection*{Setting the substitution model}

Select the Site model tab, and change the site model to HKY, and frequency model to `empirical'.
The screen should look like this:

\includegraphics[scale=0.4,clip=true,trim=0 300 0 0]{figures/BEAUti_sitemodel.png}

\subsubsection*{Setting the clock model}

We use a strict clock, so leave the clock tab.

\subsubsection*{Priors and Operators}

Change the tree prior from Yule to Coalescent with Constant Population. The other priors are fine. The screen should look like this:

\includegraphics[scale=0.4,clip=true,trim=0 400 0 0]{figures/BEAUti_priors.png}


\subsubsection*{Setting up the geographic model}

Go to the data partitions tab, and click the '+' button at the bottom of the screen.
A dialog pops up where you can select what you want to add. Choose `Add Continuous Geography'
and click `OK'.

\includegraphics[scale=0.4]{figures/BEAUti_geography1.png}

A new window pops up where you can choose the tree where you want to add geography.
Also, you can change the trait name to `location';

\includegraphics[scale=0.4,clip=true,trim=300 200 300 200]{figures/BEAUti_geography2.png}

When you click OK, a dialog is shown where you can enter latitude and longitude for each taxon.
In this tutorial, this information is encoded in the taxon name, so we can guess it from the name. 

\includegraphics[scale=0.4]{figures/BEAUti_geography3.png}

Click `Guess latitude', and a dialog is shown. Choose `split on character' and take group 3 for the latitude.
When you click OK, the latitude field of the table is populated.

\includegraphics[scale=0.4]{figures/BEAUti_geography4.png}

Click `Guess longitude', and again a dialog is shown. Choose `split on character' and take group 4 for the longitude.

\includegraphics[scale=0.4]{figures/BEAUti_geography5.png}

When you click OK, the table is completely populated. However, longitudes are taken to be 
east instead of west, so we have to make all longitudes negative. The easiest way to do this
is to press the 'Manipulate longitude' button. An optionpane pops up where you can enter
a formula that is applied to all longitude values, like so:

\includegraphics[scale=0.4]{figures/BEAUti_geography5a.png}

Key in, '-\$x', and press OK. Now, the longitudes and latitudes are properly populated, like so:

\includegraphics[scale=0.4]{figures/BEAUti_geography6.png}

Click close, and a second data partition is created.

\includegraphics[scale=0.4]{figures/BEAUti_DataPartitions2.png}

The clock model now looks like this:

\includegraphics[scale=0.4,clip=true,trim=0 300 0 0]{figures/BEAUti_clockmodel2.png}

For the location, we select a relaxed clock with log-normal distribution.

\includegraphics[scale=0.4,clip=true,trim=0 300 0 0]{figures/BEAUti_clockmodel3.png}

The priors now look like this:

\includegraphics[scale=0.4]{figures/BEAUti_priors2.png}

To get the location edit dialog back, double click the location partition.


\subsubsection*{Setting the MCMC options }

The next tab, {\bf MCMC}, provides more general
settings to control the length of the MCMC and the file names. 

Firstly we have the \textbf{Length of chain}. This is the number of
steps the MCMC will make in the chain before finishing. The appropriate length of the chain depends on the size of the data set, the complexity of the
model and the accuracy of the answer required. The default value of 10,000,000
is entirely arbitrary and should be adjusted according to the size
of your data set. For this data set let's initially set the chain
length to 20,000,000 as this will run reasonably quickly on most modern
computers (less than 20 minutes).

The next options specify how often the parameter values in the Markov
chain should be displayed on the screen and recorded in the log file.
The screen output is simply for monitoring the programs progress so
can be set to any value (although if set too small, the sheer quantity
of information being displayed on the screen will actually slow the
program down). For the log file, the value should be set relative
to the total length of the chain. Sampling too often will result in
very large files with little extra benefit in terms of the precision
of the analysis. Sample too infrequently and the log file will not
contain much information about the distributions of the parameters. 
You probably want to aim to store no more than 10,000 samples so this should be
set to no less than chain length / 10,000.

For this exercise we will set the screen log to 100000 and the file log to 20000. The final two
options give the file names of the log files for the sampled parameters and
the trees. These will be set to a default based on the name of the
imported NEXUS file. 

\begin{figure}
\begin{center}

\includegraphics[scale=0.4]{figures/BEAUti_MCMC}

\end{center}
\caption{\label{fig.MCMC} Setting up the MCMC paremeters.}
\end{figure}


If you are using windows then we suggest you add the suffix \texttt{.txt} to both of these (so,
\texttt{gopher.log.txt} and \texttt{gopher.trees.txt}) so that Windows recognizes
these as text files. 

\subsubsection*{Generating the BEAST XML file }

We are now ready to create the BEAST XML file. To do this, either select the {\bf File/Save} or {\bf File/Save As} option from the \textbf{File} menu. Check the default priors setting and click \textbf{Continue}. Save the file with an appropriate name (we usually end the filename with \texttt{.xml}, i.e., \texttt{RacRABC.xml}). We are now ready to run the file through BEAST. 

\subsection*{Running BEAST }

Now run BEAST and when it asks for an input file, provide your newly
created XML file as input by click \textbf{Choose File ...}, and then click \textbf{Run}. 

\begin{figure}
\begin{center}

\includegraphics[scale=0.5]{figures/BEAST}

\end{center}
\caption{\label{fig.BEAST} Launching BEAST.}
\end{figure}


BEAST will then run until it has finished
reporting information to the screen. The actual results files are
saved to the disk in the same location as your input file. The output to the screen will
look something like this: 

{\scriptsize   
\begin{verbatim}

          BEAST v2.0.e Prerelease r794, 2002-2011
       Bayesian Evolutionary Analysis Sampling Trees
                 Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut and Marc A. Suchard
                              
               Department of Computer Science
                   University of Auckland
                  remco@cs.auckland.ac.nz
                  alexei@cs.auckland.ac.nz
                              
             Institute of Evolutionary Biology
                  University of Edinburgh
                     a.rambaut@ed.ac.uk
                              
              David Geffen School of Medicine
           University of California, Los Angeles
                     msuchard@ucla.edu
                              
                Downloads, Help & Resources:
              	http://beast2.cs.auckland.ac.nz
                              
Source code distributed under the GNU Lesser General Public License:
              	http://code.google.com/p/beast2
                              
                     BEAST developers:
	Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled, 
	Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li, 
	Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel, 
          	Oliver Pybus, Chieh-Hsi Wu, Walter Xie
                              
                         Thanks to:
    	Roald Forsberg, Beth Shapiro and Korbinian Strimmer

File: RacRABC.xml seed: 1347411445277 threads: 1

Random number seed: 1347411445277

Skip loading file:/home/remco/workspace/beast2/lib/fest-swing-1.2.jar: contains classs org.fest.swing.annotation.GUITest that is already loaded
Loaded URL file:/home/remco/workspace/beast2/lib/fest-util-1.1.2-sources.jar
Skip loading file:/home/remco/workspace/beast2/lib/junit-4.8.2.jar: contains classs junit.extensions.ActiveTestSuite$1 that is already loaded
Loaded URL file:/home/remco/workspace/beast2/lib/fest-assert-1.2-sources.jar
Skip loading file:/home/remco/workspace/beast2/lib/fest-swing-junit-4.
... ...

\end{verbatim}}

\subsection*{Analyzing the results}

Run the program called {\bf Tracer} to analyze the output of BEAST. When the main
window has opened, choose {\bf Import Trace File...} from the {\bf File} menu and select the file that
BEAST has created called \texttt{RacRABV.123123.log} where 123123 is the seed you used.
You should now see a window like in Figure \ref{fig.tracer1}.

\begin{figure}
\begin{center}

\frame{\includegraphics[scale=0.4]{figures/Tracer}}

\end{center}
\caption{\label{fig.tracer1} Tracer with the rabbies data.}
\end{figure}


Remember that MCMC is a stochastic algorithm so the actual numbers will not be exactly the same.

On the left hand side is a list of the different quantities that BEAST has logged. There are traces for the posterior (this
is the log of the product of the tree likelihood and the prior probabilities), and the continuous parameters. Selecting a trace
on the left brings up analyses for this trace on the right hand side depending on tab that is selected. When first opened, the
`posterior' trace is selected and various statistics of this trace are shown under the Estimates tab.
In the top right of the window is a table of calculated statistics for the selected trace. 

Tracer will plot a (marginal posterior) distribution for the selected parameter and also give you statistics such as the mean and median. The \texttt{95\% HPD lower} or \texttt {upper} stands for {\it highest posterior density interval} and represents the most compact interval on the selected parameter that contains 95\% of the posterior probability. It can be thought of as a Bayesian analog to a confidence interval. 


\section*{Questions}

To determine whether a relaxed clock is supported by the data over a strict clock,
examine the coefficient of variation. What is the mean, and the shape of the distribution
of the coefficient of variation? Is a relaxed clock is appropriate for this data?

\vspace{5 mm}
\framebox(420,60){}
\vspace{5 mm}



\subsection*{Obtaining an estimate of the phylogenetic tree}

BEAST also produces a sample of plausible trees. 
These can be summarized using the program {\bf TreeAnnotator}. This will take the set of trees and identify a single tree that best represents the posterior distribution. It will then annotate this selected tree topology with the mean ages of all the
nodes as well as the 95\% HPD interval of divergence times for each clade in the selected tree. It will also calculate the posterior clade probability for each
node. Run the {\bf TreeAnnotator} program and set it up to look like in Figure \ref{fig.TreeAnnotator}.

\begin{figure}
\begin{center}

\frame{\includegraphics[scale=0.4]{figures/TreeAnnotator}}

\end{center}
\caption{\label{fig.TreeAnnotator} Using TreeAnnotator to summarise the tree set.}
\end{figure}


The burnin is the number of trees to remove from the start of the sample. Unlike {\bf Tracer} which specifies the number of steps as a burnin, in {\bf TreeAnnotator} you need to specify the actual number of trees. For this run, we use the default setting.

The {\bf Posterior probability limit} option specifies a limit such that if a node is found at less than this frequency in the sample of trees (i.e., has a posterior probability less than this limit), it will not be annotated. The default of 0.5 means that only nodes seen in the majority of trees will be annotated. Set this to zero to annotate all nodes.

For {\bf Target tree type} you can either choose a specific tree from a file or ask TreeAnnotator to find a tree in your sample. The default option, {\bf Maximum clade credibility tree}, finds the tree with the highest product of the posterior probability of all its nodes.

Choose {\bf Mean heights} for node heights. This sets the heights (ages) of each node in the tree to the mean height across the entire sample of trees for that clade.

For the input file, select the trees file that BEAST created (by default this will be called \texttt{gopher.species.trees}) and select a file for the output (here we called it \texttt{gopher.species.tree}).

Now press \texttt{Run} and wait for the program to finish.

\subsection*{Viewing the Sprecies Tree}

We can look at the tree in another program called {\bf FigTree}. Run this program, and open
the \texttt{gopher.species.tree} file by using the Open command in the File menu. The tree should appear.
You can now try selecting some of the options in the control panel on the left. Try selecting
{\bf Node Bars} to get node age error bars. Also turn on {\bf Branch Labels} and select {\bf posterior} to get
it to display the posterior probability for each node. Under {\bf Appearance} you can also tell FigTree
to colour the branches by the rate.
You should end up with something like Figure \ref{fig.figtree}.

\begin{figure}
\begin{center}

\frame{\includegraphics[scale=0.4]{figures/figtree.png}}

\end{center}
\caption{\label{fig.figtree} Figtree representation of the summary tree. Branch colours represent diffusion rates.}
\end{figure}


Alternatively, you can load the species tree set (note this is NOT the summary tree, but the complete set) into DensiTree and set it up as follows.

\begin{itemize}
\item Set burn-in to 1000. The tree should not be collapsed any more.
\item Show a root-canal tree to guide the eye. Perhaps, the intensity of the trees is not large enough, so you might want to increase the intensity by clicking the icon in the button bar. If the root canal tree has negative branch lengths, this is an indication that there is little support for the clade under the branch that goes into the wrong direction. You can experiment with a few different root canal trees to solve this problem. Potentially, you need to reorder the taxa (choose something under the Edit/Shuffle submenu) to make the root canal tree look good.
\item Show a grid, and play with the grid options to only show lines at 5 year intervals covering round numbers (that is, 2000, instead of 2001.22).
\end{itemize}

The image should look something like Figure \ref{fig.DensiTree}

\begin{figure}
\begin{center}

\frame{\includegraphics[scale=0.4]{figures/DensiTree.pdf}}

\end{center}
\caption{\label{fig.DensiTree} DensiTree representation of the species tree.}
\end{figure}

\subsection*{Post processing geography}

Start spread by double clicking the spread.jar file.

Select the 'continuous' tab.

Click the load-button and select the summary tree file.

Now, open the 'Output' tab in the panel on the left hand side. Here, you can choose where to save the kml file (default {\tt output.kml}).

Click the plot button, and a world map appears with the tree superimposed onto the area where the rabies epidemic occurred.

\frame{\includegraphics[scale=0.4]{figures/spread.png}}

The KML file can be read into google earth. Here, the spread of the epidemic can be animated through time. The coloured areas represent the 95\% HPD regions of the locations of the internal nodes of the summary tree.

\frame{\includegraphics[scale=0.4]{figures/google-earth.png}}

\bibliographystyle{plain}

\bibliography{phylogeography_c}


\end{document}
